1 - Introduction


In this analysis, we will analyze Formula 1 data from the year 2,000 and onward. The reason for this time frame is because of two things: the first begin that there no active drivers that began their F1 careers before the year 2,000, and the second being that, beginning in the year 2,000, it was mandated that all racecars have the same engine size, which would change overtime.



Furthermore, our analysis dives into the relationship between a driver’s starting position and the likelihood of obtaining a certain finishing result. One would assume that the better one qualifies, the higher the change of winning the race. Also, it is known that there are a few big teams/drivers who have dominated the sport over the years and continue to do so, thus providing relatively consistent results. One can easily assume that certain teams are much more likely to win a Grand Prix thus those teams are more statistically significant in determining the result of a race. Because of this, we also investigate the relationship between the results/points given the constructors (teams) and drivers.

Lastly, we create several prediction models that allow for predicting a race results. We use ONLY the parameters that one would have available just before a race: the team name, the driver name, the era (engine size) and the starting position.


2 - Packages


knitr::opts_chunk$set(echo = TRUE)

library(dplyr)
library(ggplot2)
library(caret)
library(class)


3 - EDA


Starting Position Vs Race Result
Unfortunately, due to lack of programming knowledge, we were unable to create the plot that we wanted. However, we found exactly what we needed online and thus decided to include that plot with the source tagged just below it.

From the plot, one can see that there does in-fact appear to be a relationship between a driver’s starting position. It is clear that the higher one starts on the grid, the higher the probability of winning the race. In addition these findings, we also noted that the lower one starts on the grid, the higher the probability of not finishing the race (DFN), which could make sense if one assumes that drivers towards the back are driving much less conservatively and thus more aggressive than those towards the front.

Result Probability By Qualifying Position Source: MrBookman_LibraryCop”(Reddit User)
Click Here to Visit Source



Constructor/Driver Vs Race Result

From the results below, it is clear that Mercedes has the highest number of top positions throughout Formula 1 history, which is without a doubt the reason for their extreme success in F1 racing. Higher average position leads to higher average points and ultimately the season trophy. Remember, one does not have to win every race in order to win the season trophy - it’s all about points.



This plot below depicts the drivers’ count by race result. Given the results from the previous plot, one would expect a Mercedes driver to have the highest number of top-5 finishing positions. Sure and behold, the Kubica appears to have a higher number of top-5 finishing positions in comparison to the other drivers. Amon appears to be a more consistent driver but has less top-5 results.

4 - Modeling Data


The data includes all race results since the year 2,000 and the following attributes:
  • Driver name/id // currently only the 25 drivers with the most race results have a unique name. All other drivers have there names labeled ‘Other’.
  • Constructor name/id // currently only the 25 constructors with the most reace results have a unique name. All other constructors have there names labeled ‘Other’.
  • Race name/id.
  • Starting position labeled ‘grid’.
  • Finishing position labeled ‘positionOrder’.
  • The altitude of a track.
  • The era of engine size.
F1DataFrame = read.csv("ModelingData/F1ModelingCustomCategoricalVariables.csv")

head(F1DataFrame)
##   X driverId constructorId circuitId raceId year                name.x resultId
## 1 1        1             1         1     18 2008 Australian Grand Prix        1
## 2 2        1             1        35    875 2012     Korean Grand Prix    21601
## 3 3        1             1        11     46 2007  Hungarian Grand Prix      589
## 4 4        1           131        22   1026 2019   Japanese Grand Prix    24528
## 5 5        1             1         7    344 2010   Canadian Grand Prix    20491
## 6 6        1             1         4      5 2009    Spanish Grand Prix     7642
##   grid positionOrder                         name.y alt     name forename
## 1    1             1 Albert Park Grand Prix Circuit  10  McLaren    Lewis
## 2    3            10   Korean International Circuit   0  McLaren    Lewis
## 3    1             1                    Hungaroring 264  McLaren    Lewis
## 4    4             3                 Suzuka Circuit  45 Mercedes    Lewis
## 5    1             1      Circuit Gilles Villeneuve  13  McLaren    Lewis
## 6   14             9 Circuit de Barcelona-Catalunya 109  McLaren    Lewis
##    surname       fullname                  era
## 1 Hamilton Lewis Hamilton        90 degrees V8
## 2 Hamilton Lewis Hamilton        90 degrees V8
## 3 Hamilton Lewis Hamilton        90 degrees V8
## 4 Hamilton Lewis Hamilton 90 degrees V6 + MGUs
## 5 Hamilton Lewis Hamilton        90 degrees V8
## 6 Hamilton Lewis Hamilton        90 degrees V8


Logic to split data into training and testing sets.

modelVariables = c('positionOrder', 'grid', 'name', 'fullname', 'era')
modelDataframe = F1DataFrame[, modelVariables]

set.seed(4)

trainingIndices = sample(c(1:dim(modelDataframe)[1]), dim(modelDataframe)[1]*0.8)

trainingDataframe = modelDataframe[trainingIndices,]
testingDataframe = modelDataframe[-trainingIndices,]


5 - Model 1 (Liner Regression)


The Model Our initial model includes all of the data from 1989 and onward without any data cleaning.
mymodel<-lm(positionOrder ~ grid + name + fullname + era, data = F1DataFrame)

Residuals Vs Fitted Plot
  • Shows evidence of a negative slope with variance that seems to increase as fitted values increase.
  • This means the model is on average predicting worse results for drivers who finish with a good results.
  • As the fitted values increase it seems to get a wider difference in predicted vs actual results.
Q-Q Plot
  • Suggests a normal distribution can be assumed
Scale-Location Plot
  • Provides more evidence for normality.
  • There seems to be a patterns in starting at each discrete fitted value and traveling parallel to the Y axis with a sin function but with sharp 90 degree transitions.
  • Currently we don’t have an explanation for why this is.
mymodel<-lm(positionOrder ~ grid + name + fullname + era, data = F1DataFrame)
par(mfrow = c(1,1))
plot(mymodel)


Model Summary Statistics

  • The model doesn’t not perform well With R-squared values at 0.25.
  • Surprisingly starting position doesn’t appear to have a strong of a correlation with the race result as we originally expected. One would expect there to be a strong correlation between starting position and finishing position.
summary(mymodel)
## 
## Call:
## lm(formula = positionOrder ~ grid + name + fullname + era, data = F1DataFrame)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.3122  -4.6733  -0.9341   4.2487  21.6598 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  15.904265   0.847114  18.775  < 2e-16 ***
## grid                         -0.028042   0.009928  -2.825 0.004742 ** 
## nameAlphaTauri               -2.885055   0.792023  -3.643 0.000271 ***
## nameAlpine F1 Team           -3.321502   0.888146  -3.740 0.000185 ***
## nameArrows                    1.445664   0.696238   2.076 0.037876 *  
## nameBAR                      -1.136120   0.717750  -1.583 0.113469    
## nameBMW Sauber               -5.609388   0.830871  -6.751 1.52e-11 ***
## nameCaterham                  2.767392   0.822187   3.366 0.000765 ***
## nameFerrari                  -4.594213   0.572806  -8.021 1.14e-15 ***
## nameForce India              -2.050777   0.668144  -3.069 0.002149 ** 
## nameHaas F1 Team              0.068101   0.744726   0.091 0.927141    
## nameHonda                    -0.202469   0.887057  -0.228 0.819457    
## nameHRT                       5.360363   0.830318   6.456 1.11e-10 ***
## nameJaguar                    0.270974   0.769901   0.352 0.724874    
## nameJordan                    0.652334   0.634822   1.028 0.304162    
## nameLotus F1                 -2.651931   0.804287  -3.297 0.000979 ***
## nameMarussia                  2.957332   0.829335   3.566 0.000364 ***
## nameMcLaren                  -3.608018   0.574446  -6.281 3.47e-10 ***
## nameMercedes                 -5.287063   0.662341  -7.982 1.55e-15 ***
## nameMinardi                   2.152997   0.629614   3.420 0.000629 ***
## nameOther                     3.265153   0.568550   5.743 9.50e-09 ***
## nameRed Bull                 -4.788048   0.637921  -7.506 6.48e-14 ***
## nameRenault                  -2.495175   0.621994  -4.012 6.06e-05 ***
## nameSauber                   -0.573243   0.592537  -0.967 0.333342    
## nameToro Rosso               -0.559865   0.616570  -0.908 0.363877    
## nameToyota                   -1.596507   0.700748  -2.278 0.022724 *  
## nameWilliams                 -2.315616   0.572091  -4.048 5.20e-05 ***
## fullnameCarlos Sainz         -2.963388   0.843671  -3.512 0.000445 ***
## fullnameDaniel Ricciardo     -2.795338   0.796197  -3.511 0.000448 ***
## fullnameDavid Coulthard      -1.324745   0.788190  -1.681 0.092835 .  
## fullnameFelipe Massa         -3.282596   0.761831  -4.309 1.65e-05 ***
## fullnameFernando Alonso      -3.674028   0.742275  -4.950 7.52e-07 ***
## fullnameGiancarlo Fisichella -3.780827   0.759391  -4.979 6.48e-07 ***
## fullnameJarno Trulli         -2.536071   0.777466  -3.262 0.001109 ** 
## fullnameJenson Button        -3.723810   0.761335  -4.891 1.01e-06 ***
## fullnameKevin Magnussen      -1.425999   0.928718  -1.535 0.124696    
## fullnameKimi Räikkönen       -3.642620   0.748826  -4.864 1.16e-06 ***
## fullnameLance Stroll         -4.521595   0.889203  -5.085 3.72e-07 ***
## fullnameLewis Hamilton       -6.394946   0.799229  -8.001 1.33e-15 ***
## fullnameMark Webber          -2.428414   0.806392  -3.011 0.002605 ** 
## fullnameMax Verstappen       -5.027525   0.882818  -5.695 1.26e-08 ***
## fullnameMichael Schumacher   -4.954514   0.762672  -6.496 8.51e-11 ***
## fullnameNick Heidfeld        -1.755088   0.839081  -2.092 0.036485 *  
## fullnameNico Hülkenberg      -2.088303   0.767593  -2.721 0.006525 ** 
## fullnameNico Rosberg         -3.314610   0.841876  -3.937 8.28e-05 ***
## fullnameOther                -0.798539   0.647613  -1.233 0.217578    
## fullnameRalf Schumacher      -3.099923   0.828450  -3.742 0.000183 ***
## fullnameRomain Grosjean      -1.305760   0.919098  -1.421 0.155427    
## fullnameRubens Barrichello   -3.498207   0.754031  -4.639 3.53e-06 ***
## fullnameSebastian Vettel     -5.476040   0.762718  -7.180 7.34e-13 ***
## fullnameSergio Pérez         -4.412941   0.732878  -6.021 1.77e-09 ***
## fullnameValtteri Bottas      -4.581384   0.831545  -5.509 3.66e-08 ***
## era90 degrees V8              0.187351   0.193886   0.966 0.333913    
## eraUp to 12 cylinders        -1.030652   0.224903  -4.583 4.63e-06 ***
## eraV10                       -1.714270   0.247002  -6.940 4.08e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.475 on 13975 degrees of freedom
## Multiple R-squared:  0.2512, Adjusted R-squared:  0.2483 
## F-statistic: 86.82 on 54 and 13975 DF,  p-value: < 2.2e-16


Investigation
  • The grouping of low starting positions and high finishing positions during the era of up to 12 cylinders explains the irregularities from normality in the models residual plots and lack of statistical significance in the model for stating position.
mymodel<-lm(positionOrder ~ grid, data = F1DataFrame)

ggplot(data = F1DataFrame, aes(x = grid, y = positionOrder, colour = era)) + 
  geom_point(position = "jitter", size = .8) +
  labs(title="starting vs finishing position by era", 
     y="finishing position", 
     x="starting position")



  • Plots look similar to the full model as expected.
mymodel<-lm(positionOrder ~ grid, data = F1DataFrame)
par(mfrow = c(1,1))
plot(mymodel)


6 - Model 2 (Liner Regression)


  • Taking out the data from the 12 cylinders era further improves the R-squared values to 0.39
  • Now starting position appears to be statistically significant which matches our intuition.
  • This model appears to be the best and provides the highest R-squared of all the models, but not the least amount of error.
F1DataFrame = F1DataFrame %>% filter(F1DataFrame$year >= "2000")
mymodel = lm(positionOrder ~ grid + name + fullname + era + grid:era, data = F1DataFrame)
simpleModel = lm(positionOrder ~ grid, data = F1DataFrame)
par(mfrow = c(1,1))
plot(mymodel)


Model 2 Summary Statistics

summary(mymodel)
## 
## Call:
## lm(formula = positionOrder ~ grid + name + fullname + era + grid:era, 
##     data = F1DataFrame)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.305  -3.527  -1.062   2.713  19.492 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  10.18992    0.68222  14.936  < 2e-16 ***
## grid                          0.35685    0.01675  21.306  < 2e-16 ***
## nameAlphaTauri               -0.96330    0.60783  -1.585 0.113043    
## nameAlpine F1 Team           -2.01214    0.68050  -2.957 0.003116 ** 
## nameArrows                    1.68047    0.70947   2.369 0.017875 *  
## nameBAR                      -0.18805    0.58039  -0.324 0.745948    
## nameBMW Sauber               -3.24381    0.63884  -5.078 3.90e-07 ***
## nameCaterham                  1.19126    0.63429   1.878 0.060402 .  
## nameFerrari                  -3.45016    0.45662  -7.556 4.56e-14 ***
## nameForce India              -1.61940    0.51491  -3.145 0.001666 ** 
## nameHaas F1 Team              0.37112    0.56909   0.652 0.514330    
## nameHonda                    -0.41850    0.69192  -0.605 0.545302    
## nameHRT                       2.58153    0.65514   3.940 8.20e-05 ***
## nameJaguar                    0.77193    0.60813   1.269 0.204344    
## nameJordan                    1.19768    0.58538   2.046 0.040788 *  
## nameLotus F1                 -1.64706    0.61482  -2.679 0.007399 ** 
## nameMarussia                  1.13948    0.64156   1.776 0.075747 .  
## nameMcLaren                  -1.58150    0.45005  -3.514 0.000443 ***
## nameMercedes                 -2.96328    0.52171  -5.680 1.39e-08 ***
## nameMinardi                   1.17922    0.62166   1.897 0.057874 .  
## nameOther                     0.32732    0.46878   0.698 0.485051    
## nameRed Bull                 -2.67785    0.49441  -5.416 6.24e-08 ***
## nameRenault                  -1.48690    0.47883  -3.105 0.001907 ** 
## nameSauber                   -0.28114    0.47039  -0.598 0.550072    
## nameToro Rosso               -0.30402    0.47383  -0.642 0.521137    
## nameToyota                   -0.71284    0.54921  -1.298 0.194339    
## nameWilliams                 -0.19097    0.44755  -0.427 0.669601    
## fullnameCarlos Sainz         -1.92574    0.64850  -2.970 0.002990 ** 
## fullnameDaniel Ricciardo     -2.15690    0.61099  -3.530 0.000417 ***
## fullnameDavid Coulthard      -1.13276    0.66200  -1.711 0.087092 .  
## fullnameFelipe Massa         -2.00781    0.59393  -3.381 0.000726 ***
## fullnameFernando Alonso      -2.56675    0.57460  -4.467 8.03e-06 ***
## fullnameGiancarlo Fisichella -1.91949    0.60986  -3.147 0.001653 ** 
## fullnameJarno Trulli         -0.71872    0.62250  -1.155 0.248299    
## fullnameJenson Button        -2.81085    0.58919  -4.771 1.86e-06 ***
## fullnameKevin Magnussen      -1.31376    0.71091  -1.848 0.064634 .  
## fullnameKimi Räikkönen       -2.05271    0.58146  -3.530 0.000417 ***
## fullnameLance Stroll         -2.88345    0.68776  -4.192 2.79e-05 ***
## fullnameLewis Hamilton       -4.29182    0.62320  -6.887 6.09e-12 ***
## fullnameMark Webber          -1.49148    0.62177  -2.399 0.016470 *  
## fullnameMax Verstappen       -3.24201    0.67972  -4.770 1.87e-06 ***
## fullnameMichael Schumacher   -1.76263    0.66470  -2.652 0.008021 ** 
## fullnameNick Heidfeld        -1.50224    0.64358  -2.334 0.019606 *  
## fullnameNico Hülkenberg      -1.05863    0.58787  -1.801 0.071766 .  
## fullnameNico Rosberg         -2.43295    0.65492  -3.715 0.000205 ***
## fullnameOther                -1.04727    0.49630  -2.110 0.034871 *  
## fullnameRalf Schumacher      -2.27859    0.68996  -3.302 0.000962 ***
## fullnameRomain Grosjean      -0.69624    0.70355  -0.990 0.322388    
## fullnameRubens Barrichello   -1.97576    0.63877  -3.093 0.001987 ** 
## fullnameSebastian Vettel     -2.94648    0.58946  -4.999 5.88e-07 ***
## fullnameSergio Pérez         -2.77168    0.56127  -4.938 8.02e-07 ***
## fullnameValtteri Bottas      -3.38177    0.64398  -5.251 1.54e-07 ***
## era90 degrees V8             -0.19567    0.28781  -0.680 0.496616    
## eraV10                       -0.45452    0.35751  -1.271 0.203627    
## grid:era90 degrees V8         0.03029    0.02157   1.404 0.160292    
## grid:eraV10                  -0.08097    0.02809  -2.882 0.003959 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.939 on 9104 degrees of freedom
## Multiple R-squared:  0.367,  Adjusted R-squared:  0.3632 
## F-statistic: 95.97 on 55 and 9104 DF,  p-value: < 2.2e-16
summary(simpleModel)
## 
## Call:
## lm(formula = positionOrder ~ grid, data = F1DataFrame)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.289  -3.824  -1.086   3.015  18.930 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.070542   0.109514   46.30   <2e-16 ***
## grid        0.550774   0.008669   63.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.157 on 9158 degrees of freedom
## Multiple R-squared:  0.3059, Adjusted R-squared:  0.3058 
## F-statistic:  4036 on 1 and 9158 DF,  p-value: < 2.2e-16


7 - Plots for Each Era


  • Seems to show a weak but significant correlation between better starting positions with better finishing positions.
  • All plots appear to have comparable distributions.
  • V10plot = F1DataFrame %>% filter(F1DataFrame$era == "V10") %>% ggplot(aes(x = grid, y = positionOrder, colour = era)) + 
      geom_point(position = "jitter", size = .8) +
      labs(title="Starting vs Finishing Position for V10 Era", 
         y="Finishing Position", 
         x="Starting Position") + theme(legend.position = "none")
    
    V8plot = F1DataFrame %>% filter(F1DataFrame$era == "90 degrees V8") %>% ggplot(aes(x = grid, y = positionOrder, colour = era)) + 
      geom_point(position = "jitter", size = .8) +
      labs(title="Starting vs Finishing Position for V8 Era", 
         y="Finishing Position", 
         x="Starting Position") + theme(legend.position = "none")
    
    V6plot = F1DataFrame %>% filter(F1DataFrame$era == "90 degrees V6 + MGUs") %>% ggplot(aes(x = grid, y = positionOrder, colour = era)) + 
      geom_point(position = "jitter", size = .8) +
      labs(title="Starting vs Finishing Position for V6 Era", 
         y="Finishing Position", 
         x="Starting Position") + theme(legend.position = "none")
    
    V10plot

    V8plot

    V6plot


    8 - Model 3 (KNN)


    • This model also performed well relative to the first two. It also provides the least amount of error, so we chose this as the best model - regardless of Model 2 having a higher R-squared.
    set.seed(1234)
    fitControl<-trainControl(method="repeatedcv",number=10,repeats=1)
    knn.fit<-train(positionOrder~.,
                   data=modelDataframe,
                   method="knn",
                   trControl=fitControl,
                   tuneGrid=expand.grid(k=c(1:10,20,30))
                   )
    knn.fit
    ## k-Nearest Neighbors 
    ## 
    ## 14030 samples
    ##     4 predictor
    ## 
    ## No pre-processing
    ## Resampling: Cross-Validated (10 fold, repeated 1 times) 
    ## Summary of sample sizes: 12626, 12629, 12627, 12626, 12627, 12627, ... 
    ## Resampling results across tuning parameters:
    ## 
    ##   k   RMSE      Rsquared   MAE     
    ##    1  5.895825  0.3962190  4.631315
    ##    2  5.738827  0.4196324  4.542079
    ##    3  5.666648  0.4309476  4.496665
    ##    4  5.614551  0.4395024  4.466336
    ##    5  5.583519  0.4447475  4.447944
    ##    6  5.566017  0.4475374  4.441407
    ##    7  5.561753  0.4481718  4.440443
    ##    8  5.551744  0.4499118  4.433137
    ##    9  5.543840  0.4512980  4.431609
    ##   10  5.538499  0.4521379  4.431004
    ##   20  5.519373  0.4552330  4.429668
    ##   30  5.528506  0.4537695  4.440500
    ## 
    ## RMSE was used to select the optimal model using the smallest value.
    ## The final value used for the model was k = 20.


    #THANK YOU!